Introduction

This is a file with exercises for the ML 2020 course. The exercises are divided into two parts. This is part 1: Principal components analysis.

PCA recap

As discussed in the lecture, the PCA solution for finding \(w\) that maximizes the variance of the projected data matrix \(Xw\) is the first right singular vector of \(X\). Further directions of maximal variance are given by the subsequent singular vectors. We typically collect all of these vectors in a matrix \(W\). The scores are given by \(T=XW\), one per column. Based on these two matrices, we can approximate \(X\) by \(TW^\top\) (which is nothing more than the projection of \(X\) onto \(WW^\top\)).

Quick tips

  • To obtain the first two right singular vectors of a matrix \(A\), one can use svd(A,nu=0,nv=2). The output is a list with elements d and v. The v contain the singular vectors. These elements can be accessed with the $ key.
  • Matrices can be subsetted with the [ operator, e.g. A$vectors[,1].
  • The scores can be obtained by multiplying \(X\) and \(W\), e.g. T = X %*% W.

Exercises

  • Normal exercises
    • Show that \(TW^\top\) is “nothing more than” \(XWW^\top\)
    • Simulate two vectors x1 = rnorm(100) and x2 = rnorm(100). Apply PCA and inspect the weights of the data matrix X = cbind(x1, x2). Recall, the weights of the first component are given by svd(X,0,1)$v.
    • Now introduce correlation, for example by running x2 = x1 + rnorm(100, sd = 0.1). Now run the PCA decomposition again and inspect the weights.
      • How does correlation affect the weights?
  • Advanced exercises (may be skipped for sake of time)
    • What is wrong with the following optimization problem: \(\max_w w^\top X^\top X w\) (with \(w\) unrestricted)? What is the largest possible covariance we can obtain?
    • Formulate the optimization problem for finding the second direction \(w_2\) given the first direction \(w_1\). Hint, \(w_2\) should be orthogonal to \(w_1\).

Solutions

  • Normal exercises
    • Note that by definition, \(T = XW\), therefore \(TW^\top = XWW^\top\).
    • The weights should be random for each run, as it totally depends on the variances of the columns.
    • If the variables are correlated, the weights are much more stable. They should be around \(\sqrt{0.5}\).
  • More difficult exercises, you can safely skip it for now
    • Let \(w\) go to infinity, then the covariance gets larger and larger. Hence, the optimization problem does not admit a solution.
    • It is \(\max_w w^\top X^\top X w\) s.t. \(w^\top w = 1\) and \(w^\top w_1 = 0\)

Data analysis with Principal Components Analysis

We have transcriptomic and metabolomic measurements from a Finnish population cohort, as part of the DILGOM study. The transcriptomic measurements can be found at ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) under accession number E-TABM-1036 (E-TABM-1036.processed.1.zip). The metabolite measurements are attached as supplemental material at (Inouye et al. 2010) (msb201093-sup-0002.zip). These data are already prepared and on Github, so download the file rna_metab.RData from the following link: https://github.com/selbouhaddani/UMCU_ML2020/blob/master/rna_metab.RData

Load the data

This code chunk loads the transcriptomic and metabolomic data into memory.

load("rna_metab.RData")

There are two datasets in your environment: rna (transcripts) and metab (metabolites). Check with ls().

Inspect the data: descriptives

Packages needed

  • install.packages("gplots")

A heatmap of metabolites is plotted. There are some groups of correlated variables it seems.

gplots::heatmap.2(cor(metab,use = 'pair'), dendrogram='none', Rowv=F, Colv=F,trace='n',
                  breaks=seq(-1,1,length.out = 25), col=gplots::bluered)

Boxplots provide a good summary to compare the distribution of the variables relative to each other. Properties such as comparable means, variances and symmetry are often good to have. For sake of visualisation, we only consider the first 100 transcripts.

par(mfrow=c(2,1))
boxplot(rna[,1:100])
boxplot(metab)

par(mfrow=c(1,1))

The distributions are quite symmetric and the scale is comparable across variables in each data set.

Run PCA on the data

We perform PCA by calculating the SVD. Note that this is much faster than running eigen on the matrix \(X^\top X\), which is 7385 times 7385.

pca.rna = svd(rna, 0, 2)
pca.metab <- svd(metab, 0, 2)

par(mfrow=c(1,2))
plot(rna %*% pca.rna$v, main = "RNA PCA plot of the scores",xlab=NA,ylab=NA)
plot(metab %*% pca.metab$v, main = "Metabolites PCA plot of the scores",xlab=NA,ylab=NA)

par(mfrow=c(1,1))

No particular structure or outliers are visible.

Now to plot the weights for the two directions. Note that the RNA dataset has around 7000 weights to plot. For the metabolite weights, we add a coloring based on the type of metabolite.

library(magrittr)
library(ggplot2)
library(gridExtra)
library(OmicsPLS)
library(illuminaHumanv3.db) # BiocManager::install("illuminaHumanv3.db")
# Color names
LLmodule <- c("ILMN_1690209",'ILMN_1766551', 'ILMN_1749131', 'ILMN_1688423', 
              'ILMN_2102670', 'ILMN_1792323', 'ILMN_1899034', 'ILMN_1806721', 
              'ILMN_1695530', 'ILMN_1726114', 'ILMN_1751625', 'ILMN_1726114', 
              'ILMN_1753648', 'ILMN_1779043')
LLnr <- which(colnames(rna) %in% LLmodule)
rna_genenames <- select(illuminaHumanv3.db, 
                        keys = colnames(rna)[LLnr], 
                        keytype = "PROBEID", columns = "SYMBOL")[,2]

name_col <- 1 + sapply( #First sapply loops over column names
  X = colnames(metab),
  FUN = function(arg){
    crossprod(
      c(1, 1, 3, 4, 5), # Weights to be used as categories
      sapply(c("VLDL", "LDL", "IDL", "HDL","FA"), # metabolite classes
             function(arg2){grepl(arg2, arg)} # compare class of metabolites
      )
    )
    }
  )
name_col <- factor(name_col, 
                   levels = c(3,2,4:6,1), 
                   labels = c("VLDL", "LDL", "IDL", "HDL","FA","Other"))

# alpmetab <- loadings(fit, "Yjoint", 1:2) %>%  # Retreive loadings
#   abs %>% # Absolute loading values for positive weights
#   rowSums %>% # Sum over the components
#   sqrt + (name_col!="Other") # Take square root

######### Plot loadings with ggplot ###
p_metab <- ggplot(data.frame(x = pca.metab$v[,1], y = pca.metab$v[, 2]), aes(x = x, y = y)) + 
##################### Add all layers ###
  theme_bw() +
  coord_fixed(ratio = 1, xlim=c(-.2,.2),ylim=c(-.2,.2)) +
  geom_point( # Set color and size
    aes(col=name_col, size = I(1+(name_col%in%c("VLDL","HDL"))), 
          shape = name_col),show.legend = T) +
  theme(legend.position="right") +
  scale_color_discrete(name="Metabolite\nGroup",
                       labels=c("VLDL", "LDL", "IDL", "HDL","FA","Other")) +
  guides(size=F) + scale_shape_discrete(name="Metabolite\nGroup",
                                labels=c("VLDL", "LDL", "IDL", "HDL","FA","Other")) +
  scale_shape_manual(name="Metabolite\nGroup", values=c(15,3,4,17,5,6)) + 
  labs(title = "Metabolite joint loadings",
       x = "First Joint Loadings", y = "Second Joint Loadings") +
  theme(plot.title = element_text(face='bold'),
        legend.title=element_text(face='bold')) + 
  geom_hline(yintercept = 0) + geom_vline(xintercept = 0)

alprna <- pca.rna$v %>% raise_to_power(2) %>% rowSums
alprna[-(order(alprna,decreasing=T)[1:10])] = 0
alprna <- sign(alprna)
toprna <- which(alprna>0)
names_rna <- mapIds(illuminaHumanv3.db, 
       keys = colnames(rna)[toprna], 
       keytype = "PROBEID", 
       column = "SYMBOL",
       multiVals = 'first')
names_rna[which(is.na(names_rna))] <- "?"
######### Plot loadings with OmicsPLS plot method ###
p_rna <- ggplot(data.frame(x = pca.rna$v[, 1], y = pca.rna$v[, 2]), 
                aes(x = x, y = y),
                alpha = alprna,
                aes(label = NA)) +
    ##################### Add all layers ###
  theme_bw() +
  coord_fixed(.8, c(-.15,.15),c(-.15,.15)) +
  geom_point(alpha = 0.5, col = 'grey') +
  geom_point(data = data.frame(x = pca.rna$v[LLnr, 1], y = pca.rna$v[LLnr, 2]),
             shape = 2, col = 2, size = 2) + 
  geom_text(data = data.frame(x=pca.rna$v[toprna,1],y=pca.rna$v[toprna,2]),
            hjust = rep(c(1, 0), length.out = length(toprna)),
            aes(label = names_rna)) + 
  labs(title = "Transcript joint loadings",
       x = "First Joint Loadings", y = "Second Joint Loadings") +
  theme(plot.title = element_text(face='bold')) + 
  geom_hline(yintercept = 0) + geom_vline(xintercept = 0)

## Finally plot both plots in one figure.
grid.arrange(p_metab, p_rna, ncol=2)

Exercises

  • Why is it better to derive singular vectors from \(X\) than eigenvectors from \(X^\top X\), where \(X\) is the rna dataset?
  • Identify the most extreme expression probe with identify. To this end, first run plot(pca.rna$v).
    • To which gene does it map? Tip: use google.
    • Why does it have the largest weight?
  • What kind of clusters appear in the metabolite weights?
  • From previous analysis, it was shown that the LL module seemed to be connected to lipid metabolism. The genes in the LL module are shown as red triangles. Are they among the top genes?

Solutions

  • Here, t(rna) %*% rna is a 7385 by 7385 matrix. This does probably not fit into memory. Furthermore, with eigen you cannot limit the number of PC’s to be calculated.
  • Run identify(pca.rna$v, labels = colnames(rna)) and click on the most extreme values. Then click ESC. It is the second transcript in the dataset. When inspecting the boxplots of the first 10 genes, it appears that this gene has a high variance.
  • Roughly speaking: VLDL and HDL.
  • No. Because PCA does not look at the relation between rna and metab.

References

Inouye, Michael, Johannes Kettunen, Pasi Soininen, Kaisa Silander, Samuli Ripatti, Linda S Kumpula, Eija Hämäläinen, et al. 2010. “Metabonomic, transcriptomic, and genomic variation of a population cohort.” Mol. Syst. Biol. 6 (441): 441. https://doi.org/10.1038/msb.2010.93.